Hypothesis: The transcriptome of bone marrow derived mesenchymal stem cells (MSCs) transplanted into a spinal cord injury environment is altered as compared to MSCs transplanted into uninjured spinal cord.
Experimental setup & sequencing: mCherry+MSCs were transplanted into injured (75 kdyn contusion injury, 24h post SCI) or uninjured spinal cord using a glas capillary pipette. C57BL/6J female mice were utilized. At 7 days post transplantation the mCherry+MSCs were FACSed. Injured and uninjured (when applicable) were used for setting positive mCherry gate. Total RNA, digested of DNase, was isolated and sequenced (125 cycles paired-end) in two lane using the HiSeq2500 system and v4 sequencing chemistry (Illumina Inc.) performed by the SNP&SEQ Technology Platform (Stockholm, Sweden).
Data analysis: A read count matrix was obtained from the sequencing core facility. Data was analyed using the edgeR and limma packages (both available through bioconductor.org) using R version 3.4.1. Two additional key packages used were ggplot2 and data.table.
| Characteristic | Value |
|---|---|
| Samples (n): | 22 |
| Groups (n): | 3 |
| Unique ENSEMBL IDs (n): | 46078 |
Background: The expression of a gene must reach a certain threshold for it to be translated into a protein. Translation into a protein is a prerequiste for a gene to have any biological function. Thus, genes with low number of read counts across samples are probably not differentially expressed and should be removed. However, a greater sequencing depth (i.e. a larger library size) will result in a higher read count, thus introducing a bias to the DGE analysis. Therefore, prior to filtering, raw counts are transformed into counts per million (CPM) which accounts for the difference in library size.
Transformation: Raw counts are transformed into CPM by dividing each count with the sum of the column (i.e. the library size) and then multiplying by 1e6. Log-CPM is calculated by taking log2 (raw count + 0.25).
Filtering: Lowly expressed genes are removed from the count matrix by filtering. A gene is defined as highly expressed if CPM>1 for at least three samples. Three samples was chosen since it is equal to the smallest group size (which is equal between groups in this case).
Fig 1. Density of log-CPM values pre -and post filtering
Fig 1. Figure reports the density of log-CPM for every sample (by color) pre -and post filtering of lowly expressed genes. Filtering is conducted using log-CPM values. Vertical dashed line represents the cut off (log-CPM=0). The figure shows a distinct shift of the density from below the threshold (Fig 1A) to above the threshold (Fig 1B).
Background: The read count is affected by: 1) the gene expression and 2) the sequencing depth. The sequencing depth equals the library size. The library size is defined as the sum of counts for each sample (i.e. column sum). The counts per sample represent the relative abundance of each gene. Highly expressed genes can consume a substantial proportion of the library size thus making the other genes seem underexpressed. Therefore, normalization is conducted in order to ensure that the distribution of the expression is similar for each sample. All samples should have a smiliar range and distribution of expression (log-CPM).
Normalization: Scaling factors are calculated using the trimmed mean of M-values (TMM). The algorithm finds a set of scaling factors which minimizes the log-fold change between the samples. Scaling factors >1 downscale the counts while scaling factors <1 scale the counts upwards. The effective library size is then obtained by taking the product of the original library size and the scaling factor for each sample respectively.
Fig 2. Distribution of log-CPM pre -and post normalization
Fig 2. Figure reports the distribution of gene expression (log-CPM) for each sample. Fig 2A reports the distribution prior to normalization while Fig 2B reports the distribution following normalization of library sizes using trimmed means of M-values. Boxplots are based on all log-CPM values while points represent a random sample of 1e4 observations (due to processing time issues). The difference in the distribution of log-CPM using original and effective library sizes is minor but adjusted for.
Principal component analysis (PCA): A PCA aims at producing a low-dimensional representation of the dataset. The principal components are each normalized linear combinations of a set of features constructed with loadings with the intention to achieve maximal variance. Normalized means that the sum of squared loadings equals one. Furthermore, the principal components are constructed to be uncorrelated to each other.
Fig 3. Variance explained by principal components based on the 500 genes with highest variance
Fig 3. Figure reports the proportion variance explained by each principal component. Fig 3A reports the proportional variance explained by each component while Fig 3B reports the cumulative variance explained by the components. It is obvious that the first principal component explains the majority of the variance while the remaining components explain only a small portion of the variance.
Fig 4. Estimation of distribution of the variance explained by the principal components
Fig 4. Figure reports the distribution of bootstrapped proportion of variance explained by each principal component. Bootstrap is conducted using the 500 genes with the highest variance. Fig 4A reports the distribution for the first principal component while Fig 4B reports the distribution for the remaining eleven components (due to differences in magnitude). Shaded region in Fig 4A represents a bootstrapped nonparametric 95 % confidence interval. The histograms show narrow distributions which confirms the observations and conclusion in Fig 3.
Table 2. Upper and lower bounds (95 % confidence interval) for the proportion of variance explained by principal component 1 to 11
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Upper bound: | 0.958 | 0.005 | 0.004 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 |
| Lower bound: | 0.967 | 0.008 | 0.006 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 | 0.002 |
Fig 5. Multidimensional scaling plot of the 500 genes with the highest variance
Fig 5. Figure reports the distribution of the samples along the first and second principal component. Study groups are indicated with ellipses. Ellipses represent a 95 % confidence interval based on the samples within each group. There is a distinct separation between the four study groups. Uninjured samples separate from the three other study groups which contain animals exposed to SCI. The syngeneic and allogenic groups do form an “venn diagram like” pattern in relation to the injury only group in the sense that they both have one animal which overlaps with the injury control study group.
Fig 6. K means clustering (4 groups) of 1000 boostrap replicates for the first -and second principal component for the 500 genes with the highest variance
Fig 6. Figure reports the first and second principal component for 1000 bootstrapped replicates per sample based on the 500 genes with highest variance. K means clustering with 4 groups and 20 starts was conducted on both the bootstrapped data. Ellipses are centered around the mean value of the botstrapped values for each cluster and represent a 95 % confidence interval for the data in each cluster. The clustering confirms the hypothesis postulated in Fig 5.
Fig 7. Top 10 positive and negative loadings for the first -and second principal component
Fig 7. Figure reports the top 10 positive and negative loadings for the first -and second principal component.
Fig 8. Hierarchical clustering of samples based on 1000 genes with highest variance
Fig 8. Figure reports a dendrogram which illustrates hierarchical clustering of samples based on the 1000 genes with the highest variance.
Fig 9. Cox-Reid profile-adjusted likelihood estimated tagwise, common and trended dispersions
Fig 9. Figure reports tagwise, common and trended dispersions.
Background: The variance of raw counts is not independent of its mean. Furthermore, the variance of raw counts increases with the count size (opposite is true for log-counts).
Voom transformation: Acronym for mean-variance modelling at the observational level. The mean-variance relationship is estimated in the data which is then used for computing precision weights for each gene. The precision weights are then implemented in the linear modelling in order to adjust for heteroscedasticity. Estimation of precision weights is done as follows: 1) log-CPM=log2(raw count + 0.5), 2) a linear modelled is fitted to each gene, 3) lowess function is fitted to the scatterplot between average log-CPM and sqrt(st.dev), 4) the variance of the log-CPM for each gene is predicted using the lowess trend line 5) the inverse of the variance for each log-CPM value is the precision weight.
Empirical Bayes moderation of standard error: Ranks genes in order of evidence of differential expression.
Fig 10. Mean-variance relationship pre -and post voom transformation
Fig 10. Figure reports the mean-variance relationship pre -and post application of the voom function. Fig 9A reports the average log-CPM against the quarter root of the variance. Fig 9B reports average log-CPM against the log2(st.dev). Blue line reports the average log2(st.dev). The red line is a linear trend fitted to the black dots. Each black dot represents a gene. Fig 9A illustrates that the variance is decresing when the average expression is increasing. In Fig 9B the dependency is removed and the mean variance is unchanged when the average expression increases.
Determining significant gene differential expression for a contrast: A simple Bayesian model is used for moderating standard errors across genes (squeeze them towards a common value) and producing moderated t-statistics. These moderated t-statistics are used for significance analysis. Moderated t-statistics have a higher degree of freedom in comparison to usual t-statistics due to the increase in reliability resulting from the smoothening of standard errors. P-values are adjusted for multiple testing using Benjamini and Hochberg’s method to control for false discovery rate (FDR).
Fig 11. Number of differentially expressed genes for each contrast
Fig 11. Figure reports venn diagrams containing the number of differentially expressed genes for each contrast.
Table 3. Number of differentially over -and under-expressed genes for each contrast
| MSC[SCI]-MSC[naive] | MSC[SCI]-MSC[invitro] | MSC[naive]-MSC[invitro] | |
|---|---|---|---|
| Downregulated: | 5248 | 5301 | 909 |
| No change | 5148 | 5296 | 13711 |
| Upregulated: | 4888 | 4687 | 664 |
Fig 12. Mean difference -and volcano plot
Fig 12. Figure 11A reports a mean-difference plot which illustrates the number of over -and under expressed genes. Threshold is set at log2(fold change) +/-1 (blue lines). Blue dots represents genes above or below the log-fold change thresholds while red dots represent those genes which are above/below the thresholds and are significantly differentially expressed. Fig 11B is a volcano plot which reports the number of significantly over -and underexpressed genes (marked with red). Data in both plots is for the contrast “MSC[SCI] vs MSC[naive]”.
Table 4. 10 most significantly up -and downregulated differentially expressed genes
| Gene | log2(fold change) | P-value (adjusted) | Gene | log2(fold change) | P-value (adjusted) |
|---|---|---|---|---|---|
| Fos | -6.63 | 0 | Tfrc | 2.67 | 0 |
| Msr1 | -11.85 | 0 | Psat1 | 2.22 | 0 |
| Fcgr2b | -11.50 | 0 | Cars | 1.84 | 0 |
| Il7r | -9.50 | 0 | Aars | 1.61 | 0 |
| Rgs1 | -13.35 | 0 | Slc6a9 | 3.01 | 0 |
| Ms4a6c | -11.86 | 0 | 3110039I08Rik | 2.34 | 0 |
| Fosb | -5.22 | 0 | Slc7a11 | 3.24 | 0 |
| Lyz2 | -14.64 | 0 | Polr1a | 1.27 | 0 |
| Adgre1 | -11.97 | 0 | Hspa9 | 1.67 | 0 |
| Cd300lf | -10.91 | 0 | Slc7a1 | 2.09 | 0 |
Gene Ontology (GO): A major bioinformatics initiative which offers a computational representation of the biological function of genes at molecular, cellular and tissue level. This tool enables one to annotate genes with their function.
KEGG (Kyoto Encyclopedia of Genes and Genomes): KEGG pathway are manually drawn pathway maps which represent the current knowledge within metabolism, cellular processes and many more.
Table 4. GO terms and KEGG pathways
| Type | Term | Ont | N | Up | Down | P.Up | P.Down |
|---|---|---|---|---|---|---|---|
| GO | immune response | BP | 828 | 166 | 507 | 1 | 0 |
| GO | immune system process | BP | 1566 | 408 | 830 | 1 | 0 |
| GO | defense response | BP | 858 | 189 | 507 | 1 | 0 |
| GO | membrane part | CC | 3744 | 1025 | 1650 | 1 | 0 |
| GO | intrinsic component of membrane | CC | 2919 | 769 | 1309 | 1 | 0 |
| GO | regulation of immune system process | BP | 857 | 214 | 476 | 1 | 0 |
| GO | integral component of membrane | CC | 2843 | 746 | 1272 | 1 | 0 |
| GO | lysosome | CC | 389 | 67 | 255 | 1 | 0 |
| GO | lytic vacuole | CC | 389 | 67 | 255 | 1 | 0 |
| GO | regulation of immune response | BP | 454 | 90 | 286 | 1 | 0 |
| Type | Pathway | N | Up | Down | P.Up | P.Down |
|---|---|---|---|---|---|---|
| KEGG | Lysosome | 112 | 6 | 85 | 1.0000000 | 0 |
| KEGG | Rheumatoid arthritis | 64 | 2 | 53 | 1.0000000 | 0 |
| KEGG | Cytokine-cytokine receptor interaction | 149 | 26 | 99 | 0.9999952 | 0 |
| KEGG | Leishmaniasis | 60 | 8 | 48 | 0.9998808 | 0 |
| KEGG | Inflammatory bowel disease (IBD) | 44 | 2 | 38 | 0.9999996 | 0 |
| KEGG | Phagosome | 132 | 17 | 86 | 1.0000000 | 0 |
| KEGG | NOD-like receptor signaling pathway | 144 | 28 | 91 | 0.9999192 | 0 |
| KEGG | Tuberculosis | 144 | 18 | 91 | 1.0000000 | 0 |
| KEGG | Staphylococcus aureus infection | 38 | 2 | 33 | 0.9999956 | 0 |
| KEGG | Intestinal immune network for IgA production | 28 | 1 | 25 | 0.9999873 | 0 |
Background: Purpose of hierarchical clustering in combination with heatmap is to find subset of genes which explain the difference between study groups.
Hierarchical clustering: An unsupervised statistical learning method which aims at clustering the data in a not pre-determined amount of clusters. In agglomerative hierarchical clustering the tree is built from the terminal nodes (leaves) towards the root. A dendrogram is utilized to displayed the clusterings. A heatmap is added to the hierarchical clustering to enable the identification of gene clusters which might explain the hierarchical clustering.
Fig 13. Hierarchical clustering of samples together with heatmap of significantly differentially expressed genes
Fig 14. Figure reports a heat map with hierarchical clustering (indicated with dendrograms) using log-CPM values. Only significantly differentially expressed genes are included (and genes with NA symbols were removed).
Fig 15. Multiscale bootstrap resampling of hierarchical clustering
Fig 15. Figure reports hierarchical clustering using multiscale boostrap resampling with 1000 replicates. Numbers with red color are the approximately unbiased (AU) p-values while blue numbers represent bootstrap probability (BP). AU >95 and BP >70 are considered to be significant clusters. Red boxes indicate clusters with AU >95.
Fig 16. Unsupervised clustering using the 1000 genes with the highest F value (between all contrasts)
Fig 16. Figure reports a hierarchical clustering based on the 1000 genes with the highest F statistica (higher F statistica indicate a larger difference between contrasts).
Camera: Performs a competitive gene set test which accounts for inter-gene correlations. Camera evaluates if a set of genes is highly ranked relative to other genes in terms of differential expression.
Barcode plot: This function plots two sets of genes in a ranked list of statistics. Statistics are ranked left to right from smallest to largest. Shaded region in the middle of the plot represents the ranked statistics while the vertical bars reports the positions of the specified subsets. The enrichment worm (line) shows the relative enrichment of the vertical bars in each part of the plot.
Table 5. Significant and selected gene sets from MSigDB libraries
Fig 17. Barcode plot
1. Annotation: Genes are annotated with gene name using their respective ENSEMBL ID.
2. Transformation: Read count matrix is transformed into log-CPM using original library sizes.
3. Filtering: Read count matrix is filtered using log-CPM values (>0 for at least 3 samples).
4. Normalization: Effective library sizes are calculated using the library sizes for the filtered read count matrix and the trimmed mean of M values (TMM) approach.
5. Transformation: Filtered read count matrix is transformed into log-CPM matrix.
6. PCA: Conducted for the 500 genes with highest variance. Proportional variance explained, MDS and loading plots are created.
7. Design matrix: A dummy matrix which indicates which group each sample belongs.
8. Contrast matrix: Contrasts are the group comparisons of interest.
9. Voom transformation: Estimate precision weights for linear modelling to remove dependency between the variance and trhe mean.
10. Linear modelling: Linear modelling using precision weights followed by an empirical Bayes moderation.
11. Differentially expressed genes: Moderated t-statistics are used for determining significantly expressed genes for each contrast. Results are displayed with venn diagrams, mean-difference -and volcano plot and a summary table.
12. Analysis/interpretation: Using hierarchical clustering, heatmap, gene ontology and KEGG enrichment analysis and gene set analysis the difference between the study groups is sought for.
[1] R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL htts://www.R-project.
[2] Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.
[3] Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
[4] Law CW, Alhamdoosh M, Su S et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 1; referees: 3 approved]. F1000Research 2016, 5:1408.
This analysis was conducted on:
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=sv_SE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=sv_SE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=sv_SE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid parallel stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] Rtsne_0.13
## [2] caret_6.0-78
## [3] lattice_0.20-35
## [4] xml2_1.2.0
## [5] XML_3.98-1.3
## [6] RCurl_1.95-4.10
## [7] bitops_1.0-6
## [8] pander_0.6.1
## [9] knitr_1.20
## [10] pvclust_2.0-0
## [11] boot_1.3-20
## [12] rafalib_1.0.0
## [13] ggrepel_0.7.0
## [14] ellipse_0.4.1
## [15] VennDiagram_1.6.19
## [16] futile.logger_1.4.3
## [17] gplots_3.0.1
## [18] RColorBrewer_1.1-2
## [19] colorspace_1.3-2
## [20] gridExtra_2.3
## [21] cowplot_0.9.2
## [22] ggplot2_2.2.1
## [23] data.table_1.10.4-3
## [24] Mus.musculus_1.3.1
## [25] TxDb.Mmusculus.UCSC.mm10.knownGene_3.4.0
## [26] org.Mm.eg.db_3.4.1
## [27] GO.db_3.4.1
## [28] OrganismDbi_1.18.1
## [29] GenomicFeatures_1.28.5
## [30] GenomicRanges_1.28.6
## [31] GenomeInfoDb_1.12.3
## [32] AnnotationDbi_1.38.2
## [33] IRanges_2.10.5
## [34] S4Vectors_0.14.7
## [35] Biobase_2.36.2
## [36] BiocGenerics_0.22.1
## [37] edgeR_3.18.1
## [38] limma_3.32.10
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.2 plyr_1.8.4
## [3] lazyeval_0.2.1 splines_3.4.1
## [5] BiocParallel_1.10.1 digest_0.6.15
## [7] foreach_1.4.4 BiocInstaller_1.26.1
## [9] htmltools_0.3.6 gdata_2.18.0
## [11] magrittr_1.5 memoise_1.1.0
## [13] sfsmisc_1.1-2 recipes_0.1.2
## [15] Biostrings_2.44.2 gower_0.1.2
## [17] dimRed_0.1.0 matrixStats_0.53.1
## [19] blob_1.1.0 dplyr_0.7.4
## [21] graph_1.54.0 bindr_0.1.1
## [23] survival_2.41-3 iterators_1.0.9
## [25] glue_1.2.0 DRR_0.0.3
## [27] gtable_0.2.0 ipred_0.9-6
## [29] zlibbioc_1.22.0 XVector_0.16.0
## [31] DelayedArray_0.2.7 kernlab_0.9-25
## [33] ddalpha_1.3.1.1 DEoptimR_1.0-8
## [35] scales_0.5.0 futile.options_1.0.0
## [37] DBI_0.8 Rcpp_0.12.16
## [39] foreign_0.8-69 bit_1.1-12
## [41] lava_1.6 prodlim_1.6.1
## [43] pkgconfig_2.0.1 nnet_7.3-12
## [45] locfit_1.5-9.1 labeling_0.3
## [47] tidyselect_0.2.4 rlang_0.2.0
## [49] reshape2_1.4.3 munsell_0.4.3
## [51] tools_3.4.1 RSQLite_2.0
## [53] broom_0.4.3 evaluate_0.10.1
## [55] stringr_1.3.0 yaml_2.1.18
## [57] ModelMetrics_1.1.0 bit64_0.9-7
## [59] robustbase_0.92-8 caTools_1.17.1
## [61] purrr_0.2.4 bindrcpp_0.2
## [63] RBGL_1.52.0 nlme_3.1-131
## [65] RcppRoll_0.2.2 biomaRt_2.32.1
## [67] compiler_3.4.1 tibble_1.4.2
## [69] stringi_1.1.7 highr_0.6
## [71] Matrix_1.2-11 psych_1.7.8
## [73] pillar_1.2.1 rtracklayer_1.36.6
## [75] R6_2.2.2 KernSmooth_2.23-15
## [77] codetools_0.2-15 lambda.r_1.2
## [79] MASS_7.3-47 gtools_3.5.0
## [81] assertthat_0.2.0 CVST_0.2-1
## [83] SummarizedExperiment_1.6.5 rprojroot_1.3-2
## [85] withr_2.1.1 GenomicAlignments_1.12.2
## [87] Rsamtools_1.28.0 mnormt_1.5-5
## [89] GenomeInfoDbData_0.99.0 rpart_4.1-11
## [91] timeDate_3043.102 tidyr_0.8.0
## [93] class_7.3-14 rmarkdown_1.9
## [95] lubridate_1.7.3